Skip to content

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968

Open
tosky wants to merge 1 commit into
openstack-k8s-operators:mainfrom
tosky:dcn_nova_aggregate_creation_retry
Open

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968
tosky wants to merge 1 commit into
openstack-k8s-operators:mainfrom
tosky:dcn_nova_aggregate_creation_retry

Conversation

@tosky
Copy link
Copy Markdown
Contributor

@tosky tosky commented May 29, 2026

Add retry logic (10 attempts, 30s delay) to the Nova aggregate creation task to handle transient MessageDeliveryFailure errors during RabbitMQ restarts or queue rebalancing.

This aligns with the existing defensive coding pattern used throughout the ci_dcn_site role, where similar k8s_exec and Kubernetes API operations already include retry logic (see pre-ceph.yml, post-ceph.yml, etc.).

Root cause: DataPlaneDeployment triggers RabbitMQ queue rebalance during DCN deployment, causing rolling restarts. Nova aggregate creation can fail with MessageDeliveryFailure if attempted during this window.

This patch provides reactive recovery through retries. Total retry time is up to 5 minutes (10 × 30s), which covers typical RabbitMQ restart windows observed in CI.

Related-Issue: DCN deployment failure with MessageDeliveryFailure

Add retry logic (10 attempts, 30s delay) to the Nova aggregate creation
task to handle transient MessageDeliveryFailure errors during RabbitMQ
restarts or queue rebalancing.

This aligns with the existing defensive coding pattern used throughout
the ci_dcn_site role, where similar k8s_exec and Kubernetes API operations
already include retry logic (see pre-ceph.yml, post-ceph.yml, etc.).

Root cause: DataPlaneDeployment triggers RabbitMQ queue rebalance during
DCN deployment, causing rolling restarts. Nova aggregate creation can fail
with MessageDeliveryFailure if attempted during this window.

This patch provides reactive recovery through retries. Total retry time
is up to 5 minutes (10 × 30s), which covers typical RabbitMQ restart
windows observed in CI.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Related-Issue: DCN deployment failure with MessageDeliveryFailure
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jokke-ilujo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant